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ABSTRACT 

Motivation: Recently, confocal light sheet microscopy has enabled 
high-throughput acquisition of whole mouse brain 3D images at the 
micron scale resolution. This poses the unprecedented challenge of 
creating accurate digital maps of the whole set of cells in a brain. 
Results: We introduce a fast and scalable algorithm for fully auto- 
mated cell identification. We obtained the whole digital map of 
Purkinje cells in mouse cerebellum consisting of a set of 3D cell 
center coordinates. The method is accurate and we estimated an F-i 
measure of 0.96 using 56 representative volumes, totaling 1 .09 GVoxel 
and containing 4138 manually annotated soma centers. 
Availability and implementation: Source code and its documentation 
are available at http://bcfind.dinfo.unifi.it/. The whole pipeline of meth- 
ods is implemented in Python and makes use of Pylearn2 and mod- 
ified parts of Scikit-learn. Brain images are available on request. 
Contact: paolo.frasconi@unifi.it 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 

1 INTRODUCTION 

Understanding the cytoarchitecture of the mammahan central 
nervous system on a brain-wide scale is becoming a compelling 
need in neuroscience (Kasthuri and Lichtman, 2007; Sporns 
et aL, 2005). In fact, single-neuron projections often span 
through the whole encephalon (Lichtman and Denk, 2011), sup- 
porting functional connection between anatomically distant re- 
gions. Therefore, charting cellular localizations and projections 
throughout the whole brain is a mandatory step to afford a 
comprehensive view of brain function. Many efforts are thus 
devoted to build cellular-resolution, brain-wide neuroanatomical 
atlases of the mouse brain (Bohland et aL, 2009; Kleinfeld et aL, 
2011; Oh et aL, 2014). Such maps would eventually allow char- 
acterizing on a structural basis the physiology and pathology of 
the central nervous system at various stages, ranging from devel- 
opment to neurodegeneration. 

To map the structure of the mouse brain, in the past years 
several high-throughput imaging techniques have been de- 
veloped. Electron microscopy coupled with automatic tissue sec- 
tioning has been exploited to reconstruct neuronal wiring with 
nanometric resolution (Briggman et aL, 201 1; Knott et aL, 2008); 
however, its use is still limited to small brain regions because the 
slow imaging rates makes whole-brain measurements impossible 
at the moment (Briggman and Bock, 2011). On the other hand. 
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optical methods have coarser resolution, but can be used to 
image the entirety of mouse brain (Osten and Margrie, 2013). 

The three main optical approaches used to map mouse brain 
anatomy are micro-optical sectioning tomography (MOST) (Li 
et aL, 2010; Mayerich et aL, 2008), serial two-photon tomog- 
raphy (STP) (Ragan et aL, 2012) and light sheet microscopy 
(LSM) (Keller and Dodt, 2012). The former technique allows 
mouse brain reconstruction with high contrast and resolution 
in 3D, but imaging time can reach even 1 month for a single 
brain (Gong et aL, 2013). STP shows the excellent contrast and 
resolution characteristic of multiphoton microscopy, but it oper- 
ates with rough axial sampling [1 /xm section every 50 /xm 
(Ragan et aL, 2012)] and to our knowledge no full sampling 
reconstruction of a whole mouse brain has been demonstrated 
with this technique. LSM, coupled with chemical clearing pro- 
cedures to render the brain transparent (Becker et aL, 2012; 
Chung et aL, 2013), permits reconstruction of the whole mouse 
brain with micron- scale resolution in a timescale ranging from 
hours to a few days (Dodt et aL, 2007). The contrast affordable 
with this latter method is usually lower than the one of MOST 
and STP, because of residual light scattering inside the cleared 
tissue. However, LSM currently is the only method allowing 
acquiring a significant number of samples with full 3D reso- 
lution. Furthermore, an implementation called confocal light 
sheet microscopy (CLSM) shows 100% contrast increase with 
respect to conventional LSM, allowing to distinguish neuronal 
somata in whole-brain tomographies (Silvestri et aL, 2012). In 
this technique, however, different fixation efficiencies within the 
whole organ and inhomogeneous optical clearing give rise to a 
large variability in contrast throughout the entire volume (as an 
example, three regions are shown at the top of Fig. 3). Because of 
this heterogeneity, naive segmentation or localization methods 
(e.g. thresholding) cannot be apphed to analyze whole-brain 
datasets obtained with CLSM. 

The availability of advanced imaging techniques for whole 
brain mapping introduces the new challenge of extracting quan- 
titative human-readable information from the data 
(Helmstaedter et aL, 2011). There exist several proposals for 
automatic localization or segmentation of cell bodies in 2D 
(Buggenthin et aL, 2013; Navlakha et aL, 2013) and 3D micros- 
copy (Forero et aL, 2010; LaTorre et aL, 2013; Quan et aL, 2013). 
Forero et aL (2010) presented a method based on image filtering 
and object morphology analysis that automatically counts the 
number of dying cells in images of Drosophila embryos collected 
at the confocal microscope. The method was tested on small 
stacks of 130MVoxels and it attained a recall, precision and Fx 
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of 0.98, 0.97 and 0.97, respectively (see Section 3.1.2 for defin- 
itions). LaTorre et al. (2013) propose an algorithm for segment- 
ing neuronal mouse cells in 3D images of somatosensory cortex 
of 14 day old rats collected using a confocal laser scanner. The 
method, which needs information obtained in a 2D segmentation 
stage, was tested on a volume containing, in total, 600-700 neu- 
rons belonging to three different cortical layers (15.4 MVoxels). 
This method achieved a recall, precision and Fx ranging in (0.95, 
0.99), (0.94, 0.95) and (0.95, 0.97), respectively. Quan et al. (2013) 
presented a neuron soma localization method, based on a mini- 
mization problem, which was tested on an image dataset of brain 
coronal profile of transgenic fluorescence mice (2-10 weeks old) 
collected using a fluorescence MOST system. The size of tested 
stack was 1300 x 1850 voxels (361 MVoxels) and the algorithm 
localized ^^2500 neurons with a recall of 0.88. 

In this article, we address the two major challenges that arise 
when attempting to perform information extraction from CLSM 
images: large datasets, and significant contrast heterogeneity. A 
mouse brain has a volume of the order of 1 cm^, yielding image 
sizes in the TeraByte scale at the micron-resolution. In these 
cases, the only alternative to the massive use of manwork [as 
in (Briggman et al, 2011)] is the development of fully automatic 
tools. To achieve this goal, the inherent contrast variability in 
CLSM requires sufficient robustness with respect to the param- 
eters of the extraction algorithms: fine-tuning of parameters on 
different regions [as suggested e.g. by Quan et al. (2013)] may be 
practically unfeasible with images containing hundreds of thou- 
sands of neurons. 

The method presented in this article is based on three core 
algorithmic ideas: mean shift clustering to detect soma centers 
(Section 2.2), supervised semantic deconvolution by means of 
neural networks for image enhancement (Section 2.3) and mani- 
fold learning for filtering false positives (FPs) (Section 2.4). The 
implementation makes use of Pylearn2 (Goodfellow et al, 2013) 
and modified parts of Scikit-learn (Pedregosa et al, 2011). To 
demonstrate its capabilities, we applied the algorithm to localize 
and count the Purkinje cells in the cerebellum of an L7-GFP 
mouse (Tomomura et al, 2001), a transgenic animal in which 
this neuronal population is labeled with enhanced green fluores- 
cent protein (EGFP). We obtained an i^i-measure of 0.96 and an 
area under the recall-precision curve of 0.97. To our knowledge, 
this is the first complete map of a selected neuronal population in 
a large area of the mouse brain. 

2 MATERIALS AND METHODS 

2.1 Materials 

The images used for this study were obtained with CLSM, a method that 
combines the advantages of light sheet illumination with a confocal de- 
tection scheme. The protocol to obtain the images is described in detail in 
(Silvestri et al., 2012). Briefly, brain tissue is fixed with paraformaldehyde 
and subsequently cleared by substitution of water with a refractive-index- 
matching liquid (Becker et al, 2012; Dodt et al, 2007). The clearing 
procedure leads to isotropic tissue shrinkage of ~20% in each direction, 
corresponding to a reduction of ~50% in volume. Transparent brains are 
then imaged with the CLSM apparatus, which produces single-channel 8- 
bit TIFF files. The voxel size of the dataset presented here is 
0.8 X 0.8 X 1 fjLTcc'. To collect the whole volume, many parallel adjacent 
image stacks are acquired by the apparatus. The stacks partially overlap 



with the neighbors, allowing subsequent alignment and fusion via a soft- 
ware tool designed to work with large dataset (Tera Stitcher) (Bria and 
lannello, 2012). Final data are saved as a non-redundant collection of 
non-overlapping stacks; copies of the dataset at lower resolutions are also 
saved, facilitating the visualization and 3D navigation of the whole image 
(Peng et al, 2014). 

The main dataset analyzed is the whole cerebellum of a 10 day old L7- 
GFP mouse (Tomomura et al, 2001). In this transgenic animal, all 
Purkinje cells express EGFP, allowing visualization and mapping of 
this neuronal population. 

2.2 Mean shift clustering 

2.2.1 Substacking. We begin by splitting the whole 3D image into a 
set of relatively small substacks of size W x H x D. Partitioning the 
image has a number of advantages. First, it allows us to approximate a 
local-thresholding procedure (see Section 2.2.3) without incurring in the 
computational cost of fully fledged local thresholding algorithms (Sahoo 
et al, 1988). Second, dividing a large image in several substacks enables 
an immediate multi-core parallel implementation where each substack is 
processed separately in a different thread. Third, it is convenient to work 
on substacks during the manual annotation process (see Section 3.1.1), 
which is necessary to create the ground truth data used to estimate the 
quality of the predictions. 

Substacks need to overlap to avoid border effects in the subsequent 
clustering procedure (see Section 2.2.2). The overlap length M was de- 
signed to ensure that every cell with a center detected inside the substack 
of size (W — M) X (H — M) X (D - M) falls entirely within the substack 
of size W X H X D (Fig. 1). In our images, the visible region of a Purkinje 
soma ranges between 1 1 and 18 voxels in diameter, corresponding to 13 ^ 
22 /xm in the tissue (taking into account the shrinkage introduced by the 
clearing procedure). We therefore used M = 20 in our experiments. Also, 
when W, H, D range in 100-150, substacks are small enough to obtain an 
approximately local binarization threshold and, at the same time, large 
enough to keep the overhead due to the processing of overlapping regions 
within acceptable limits. All the algorithms described below operate in- 
dependently on single substacks. 

2.2.2 Cell identification Clustering is an extremely common segmen- 
tation approach in low-level computer vision and image processing. 
Typically in these applications the goal is to group together pixels sharing 
similar features or colors. Here we propose a different strategy aiming at 
grouping together voxels belonging to the same soma. Our algorithm 
outputs cluster centroids that (ideally) correspond to soma centers. 




W " 



Fig. 1. Overlapping of substacks (depicted in 2D for simplicity). 
Processing is carried out in the region of size Wx H but detected cells 
are only accepted if their centers fall within the region of size {W — M) 
X (H - M) (delimited by dashed lines). Sample accepted and rejected cells 
after processing the central substack are shown as light and dark circles, 
respectively 
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Fig. 2. Illustration of semantic deconvolution: a portion of the original 
image (left), the associated ideal image (middle), image filtered by the 
trained neural network (right). Best viewed by zooming in a computer 
screen 



Because the number of clusters is clearly unknown in our case (because it 
corresponds to the number of cells), we take a non-parametric approach 
with a variant of mean shift (Comaniciu and Meer, 2002). The algorithm 
takes as input two sets of points, L and S, where each point is represented 
by a triplet ^ = (x, y, z) of 3D coordinates. L is the set of voxels whose 
intensity exceeds the background threshold as explained in Section 2.2.3. 
The classic mean shift algorithm would start from all available data 
points, place a kernel on each of them and shift each point toward the 
mean value computed as the kernel- weighted average of the data. In our 
variant, we improve both its running time and its statistical precision by 
starting from a carefully chosen set of seeds S (see Section 2.2 A). Pseudo- 
code of our variant is listed below. 

Cluster(^, L, m, K) 

1 C = 0 

2 for each 'p e S 

3 c = p 

4 repeat 

^ ~c-^J2-^^^m(q)qKCc -q) 

6 until converged 

7 C=CU{c} 

8 return VniqiQ 

In the above code, m is a function returning the intensity of a voxel and K 
the kernel function. In practice we use a spherical kernel: 



K(a)- 



1 ii\\a\\<R 
0 otherwise 



where is a parameter that should be smaller than the expected radius of a 
cell. The normalization factor Z in line 5 is defined as Z= JnCp) 

^peL 

K(c — p) so that c gets assigned to the 'center of mass' of points falling 
within the sphere defined by the kernel function. We use KD-trees 
(Bentley, 1975) to retrieve this set of points. The function Uniq in line 
8 removes near-duplicates from C. 

2.2.3 Thresholding The overall running time of the clustering algo- 
rithm presented in Section 2.2.2 is dominated by time required to answer 
ball queries to the KD-tree, which grows at least as Od^Sllog \L\). For this 
reason, the image is thresholded to get rid of dark voxels, which are 
unlikely to be part of a soma. Thresholding also helps to limit the 
number of false-positive detections. 

We used a multi-threshold version of the maximum entropy approach 
of (Kapur et al., 1985). We set three ranges of voxel intensities and 
computed by maximum entropy the two delimiting thresholds 9i and 
02. The first range [0, ^i] was regarded as background, i.e. dark areas, 
which we assumed to contain no detectable soma. The two other ranges 
were retained as foreground. 



2.2.4 Seeding The set of seeds S is determined as follows. First, we 
extract all local maxima of the image using a 3D max-filter. Second, we 
perform a 3D convolution of the image with a normalized spherical filter 
of size r. Seeds are then all local maxima such that the corresponding 
value in the convolved image is above the binarization threshold 6i deter- 
mined in Section 2.2.3. In other words, we require that the average voxel 
intensity in the ball or radius r centered on a local maximum be above 9i. 

2.3 Supervised semantic deconvolution 

The clustering procedure described above yields good results (details in 
Section 3.2) on image regions where cell somata have high and uniform 
intensity and the contrast on dendritic trunks is modest. Other regions are 
more problematic: if the thresholding and seed selection is too strict, 
faintly visible somata disappear during the preparation of sets S and L, 
leading to false-negative detections; if too loose, then many non-soma 
regions (such as dendritic arbors or axonal bundles) are retained and FPs 
arise. To improve over this intrinsic difficulty, we carried out a prepro- 
cessing stage by applying a non-linear filter trained to boost weak somata 
and decrease the voxel intensities in non-soma regions. This step was 
carried out in a supervised fashion because we believe that the FPs versus 
false negatives (FNs) trade-off can only be properly addressed by intro- 
ducing human knowledge. The goal of semantic deconvolution is not to 
undo the blurring or degradation effects associated with the image acqui- 
sition process (as in classic deconvolution) but rather to enhance and 
standardize the visibility of specific entities of interest in the image 
(somata in our case). We trained a neural network to map the original 
image into an 'ideal' image, which is entirely black except for small white 
spheres positioned at the locations of the true cell somata. In Figure 2 we 
illustrate the concept on a small image portion. In order to smooth the 
neural network targets far away from the somata centers, we actually 
generated the ideal image by first setting the intensity of the central 
voxel to the maximum value and then applying a (non-normalized) 3D 
Gaussian filter with a = 3.5, truncated at 1. 5a. 

We reserved 10 labeled substacks to build a training set. Note that our 
approach does not require us to perform a precise segmentation of cell 
somata: markers at the locations of the true centers (see Section 3.1.1 for 
details of the ground-truth preparation procedure) are sufficient. As a 
consequence, the human effort required to carefully annotate in this way 
the 10 training substacks (0.11% of a whole cerebellum image, 1770 cells 
in total) was modest (~3 h of work). 

The use of neural networks as non-linear convolutional filters for 3D 
images has been proposed before in (Jain et al, 2007) where the goal was 
to recover human drawn cell boundaries in electron microscopy images at 
much higher (20 nm) resolution. In our case, the 10 training substacks 
would total 194 M Voxel, ~400 times the training set used in (Jain et al, 
2007). Additionally, the resulting training set would be highly unbalanced 
because, in our images, the vast majority of voxels fall in dark regions. 
Therefore, rather than performing a full convolution, we sampled ~2 
million training patches ensuring that half of them ('positive' patches) 
overlapped with locations of cell centers and the remaining half ('nega- 
tive' patches) were at least 30 voxels away from the centers and had an 
average gray level above 10. The neural network was trained on small 
cubic patches of size (2^+ 1) x (2^+ 1) x (2^+ 1). In our experiments we 
used 5 = 6, yielding patches of 2197 voxels. The goal is to predict, for 
each voxel, the conditional probability that it falls in a white area of the 
original image. A naive approach would be to use a neural network with 
(25+1)^ inputs and one single output (corresponding to the central 
voxel). However, this approach would have at least two disadvantages. 
First, nearby output voxels are correlated, and predicting them independ- 
ently is not the best choice from a statistical point of view. Second, fil- 
tering a whole volume of size n (assumed to be cubic for simplicity) would 
require time 0{n^s^). Instead, we used a neural network with (2^+1)^ 
outputs. In this way, several adjacent voxels are predicted simultaneously, 
sharing the same feature maps as in a multi-task learning problem 
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(Caruana, 1997). The semantically deconvolved image R is then obtained 
as 

1 ^ 

R(x,y,z)=- Fi+i_^j+i_y^k+\-z{x-i,y-j,z-k) (1) 

i,j,k = -s 

where (x, y, z) is the generic output voxel, Z is a normalization factor and 
F{a, b, c) denotes the 3D patch produced in output by the neural network 
when the input is the 3D patch of the original image centered at coord- 
inates (a, b, c). In this formulation, each output voxel is actually obtained 
by averaging several predictions, which helps to reduce the variance com- 
ponent of the generalization error. Using 3D output patches is also ad- 
vantageous from a computational point of view. First, note that the 
running time of a network with (2^+ 1)^ output is still 0(s^) (in facts it 
just takes twice the time of a network with a single output). Second, 
rather than moving the patch by one voxel, we may move the patch by 
skipping d voxels along each dimension. In this way, the overall running 
time is reduced to 0(n^s^/cP). In our experiments, we used d = 4 with a 
speedup of ~32 with respect to the naive approach. This is significant 
because filtering 120GVoxels takes over a day on a Xeon E5-2665 com- 
puter with 1 6 physical cores, and using the naive approach would require 
more than a month. Note that when using a stride of length d, the nor- 
malization factor Z in Equation 1 actually depends on the test point (x, y, 
z) because not all output voxels are obtained by averaging the same 
number of predictions. 

We used a network with two fully connected hidden layers: 2197 
inputs, 500 and 200, units in the hidden layers, and 2197 outputs (~1.6 
million parameters in total). Preliminary experiments with a third layer 
did not yield appreciable improvements. We used sigmoidal output units, 
which allow us to interpret each output as the conditional probability that 
a certain voxel belongs to a cell soma given the original image patch as 
input. Similarly to (Hinton et al., 2006), we pretrained the first two layers 
in an unsupervised fashion (as Gaussian-binary and binary-binary re- 
stricted Boltzmann machines, respectively). Some of the filters learned by 
the first layer of the network are shown in Supplementary Figure S4. 
Fine-tuning of the overall network was finally performed by backpropa- 
gation, training for ~100 epochs of stochastic gradient descent with mo- 
mentum and with a minibatch size of 10. Altogether, training took 
slightly <2 days on 16 cores. Semantic deconvolution was performed 
on substacks of size {W+2s) x (H+2s) x (D + 2s) to ensure that the 
cell identification subroutines (see Section 2.2) receive data with no 
border effects. 

2.4 Manifold modeling 

The procedure described in this section takes advantage of specific ana- 
tomical background knowledge. In several brain regions, such as in the 
cerebellum, cells are not scattered randomly in the 3D space but are laid 
out in manifolds. For example, the cerebellum cortex folds into folia or 
leaves that can be naturally modeled as manifolds. As it turns out, iso- 
lated or off-manifold centers predicted by the algorithms described above 
are almost invariably false-positive detections. Hence, an effective false- 
positive filter may be designed by estimating the distance of each pre- 
dicted center from the manifold formed by other predicted centers. Our 
approach exploits manifold learning [specifically, the Isomap algorithm 
(Tenenbaum et al., 2000)] and locally weighted regression (Cleveland, 
1979) to obtain such an estimate. 

Because Isomap needs to compute the eigendecomposition of the esti- 
mated geodesic distance matrix from the nearest-neighbors graph, it 
cannot be applied to large set of points. Thus, we begin by partitioning 
predicted centers into smaller subsets. The approach is inspired by a 
computer graphics technique known as chartification (Zhou et al, 
2004). Chartification algorithms, however, are typically designed to 
work on meshes rather than point clouds and they are not robust 
enough to handle the noisy detections that occur in our application. 



We used instead the following procedure. First, we obtained a set of 
seeds by computing the centroid of detections within each substack in 
which the overall image was divided (see Section 2.2.1). Second, starting 
from each seed, we formed a chart by running a uniform cost search on 
the nearest neighbors graph with edges weighted by Euclidean distances, 
proceeding until a predefined geodesic distance from the seed was 
reached. Charts obtained in this way may overlap but this is fine because 
our goal is ultimately to detect FPs. Manifold distances on each chart are 
estimated using the following algorithm: 

Manifold-Filter(C) 

//C= {(x*^'\ z*^'^), /= 1, . . . , /i} is a set of predicted centers 

1 i^=Isomap(C) 

2 f or / = 1 , . . . , ^ 

3 Let = Lowess {H\{{d^ , v^^)], C\{(x(^ /\ z^^) 

4 (^^ = \\f^\u^^ , v(')) - {x^^ , , z^^) II 

5 return 

In the above code, the procedure Isomap takes as input the predicted 
centers and returns their 2D embeddings H= {{u^^\ v^'^), . . . ,n] ob- 
tained by first computing the nearest neighbors graph to obtain estimated 
geodesic distances and then performing multidimensional scaling 
[see (Tenenbaum et al, 2000) for details]. The procedure Lowess learns 
a locally weighted regression model from the 2D coordinates (w, v) 
back to the 3D coordinates (x, y, z). Lowess is a lazy learner, which simply 
stores the training data and at prediction time performs weighted linear re- 
gression. For the sake of completeness we briefly summarize the method 
here. Given the left-out test point v^'^), we first form the matrix 
V G [R"~^^^, whose y-th row is w^^{u^^ , v^^) and where the real- valued 
weights w^^ are given by the Gaussian kernel 

We then form the matrix X g W^""^ whose >th row is w^\x^'^,y^\ z^^). 
Lowess computes its prediction as 

/'(t/('\ v^'^) = (w^'\ v('^)^(V^V)" V^X. (3) 

To reduce the influence of outliers, we finally used the iterative reweight- 
ing approach described in (Cleveland, 1979). Intuitively, J^\u^^\ v^'^) 
reconstructs the 3D coordinates of the /-th center given the other 
centers in the chart. If the /-th center is far from the manifold, then we 
expect the Euclidean distance (see Line 4 in the above algorithm) 
between the true and reconstructed coordinates to be high, yielding a 
sensible criterion for filtering out false-positive detections. As noted 
above, our charts may overlap, meaning that multiple distance values 
are estimated whenever a center appears in multiple charts. In these 
cases, we eventually retained the minimum estimated manifold distance. 
The quality of Isomap embeddings is sensitive to outliers; we thus found 
it beneficial to run the manifold filter iteratively, first with a high dis- 
tance threshold of 40 voxels, to get rid of gross false-positive detections, 
and a second time with a smaller threshold to perform a finer grained 
filtering. 

3 RESULTS AND DISCUSSION 

3.1 Performance evaluation 

3.1.1 Ground truth To estimate the accuracy of the cell detec- 
tion algorithm we annotated soma centers in 56 substacks of a 
cerebellum image, each of size 280 x 282 x 246, for 4,138 mar- 
kers and 1.09 GVoxel (10 additional disjoint substacks were 
marked for training the semantic deconvolution network 
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described in Section 2.3). Each ground truth substack included 
exactly eight adjacent processing substacks. The set of ground 
truth substacks was chosen to cover different cerebellum regions 
and to ensure that the contrast variability in the whole image was 
well represented. Some nearly empty regions were also included 
to better estimate the false-positive rate. Cell centers were located 
with the help of a modified version of the Vaa3D software pack- 
age (Peng et al., 2010). In our version, the one-right-click pin- 
pointing procedure takes advantage of the 3D mean shift 
algorithm described in Section 2.2 but applies it to a cylinder 
whose main axis is defined by the line connecting the observer 
point and the clicked point. Using a fairly small cylinder radius 
(^^6-8 voxels) and rotating the 3D view of the image to avoid 
overlaps, the cylinder will almost always contain just one soma 
and a rehable marker can be assigned in just a few seconds. The 
3D mean shift algorithm also ensures that the marker identifies 
the soma center with good accuracy. Still, the high variability in 
image quality makes hand labeling non-trivial. We found that 
two independent human labelings on nine substacks disagree on 
40 markers of 957. 

3.1.2 Measuring performance For each substack, we compare 
the set of cluster centers C returned by the clustering procedure, 
and the set of ground truth centers G. To properly compare 
predictions against the ground truth we need to ensure that 




Fig. 3. More examples of semantic deconvolution (substacks not 
included in the training set). Top: original images. Bottom: results of 
semantic deconvolution. Best viewed by zooming in a computer screen 



each predicted soma center is uniquely associated with at most 
one ground truth center. For this purpose, we first construct an 
undirected bipartite graph with vertex set C U G. For each pair 
~c e C and g e G we add an edge with weight Wcg = — _| ^ if 

€+\\C-g\\ 

—~g\\<D, being D the expected diameter of a Purkinje soma 
(we set Z) = 16 in our experiments) and e a small constant pre- 
venting numerical overflows. We then compute the maximum 
weight bipartite matching. A predicted center ~c is considered 
to be a true positive (TP) if it is matched to a ground truth 
center g such that \\c — g||<Z)/2. Unmatched predictions are 
counted as FPs and unmatched ground truth centers are counted 
as FNs. We finally compute precision, recall and Fx measure as 
P= TP/(TP + FP), R=TP/{TP + FN) and Fx = j^. To avoid 
the bias due to border effects, we take advantage of the over- 
lapping between substacks (Fig. 1) and exclude from the TP, FP 
or FN counts all points (either predictions or ground truth) fall- 
ing in the outer region of thickness M/2. 

3.2 Mean shift clustering on the raw image 

We ran the algorithm of Section 2.2 on the raw image, with 
different values of the parameters r (radius of the seed ball) 
and R (radius of the kernel). As expected, the algorithm achieves 
its best performance when both parameters are set to a value that 
roughly corresponds to the radius of the smallest somata in the 
image (Fig. 4). Too low values for r generate too many seeds, 
increasing the chances of false-positive detections. Precision is 
also sensitive to the kernel radius because small values of R 
tend to generate multiple detections within the same true soma. 
The slight increase of FNs when increasing R can be explained as 
follows: when two somata are close to each other and almost 
touch, a large kernel drives the algorithm to converge near the 
border between the two somas. When setting r = 6 and R = 5.5, 
the cell detector on raw images attains a precision of 0.76 and a 
recall of 0.71, corresponding to 920 FPs and 1213 FNs. 

3.3 Using semantic deconvolution 

The performance of the mean shift algorithm increases dramat- 
ically when applied to the image cleaned by the semantic decon- 
volution technique described in Section 2.3. Setting r = 6 and 
R= 5.5 yields 493 FPs and just 120 FNs, corresponding to a 
precision of 0.89, a recall of 0.97 and an i^i-measure of 0.93. As 




R 



Fig. 4. Comparing performance of the mean shift algorithm before and after semantic deconvolution and varying the parameters r (seed radius ball) and 
R (kernel size). Left: performance when varying r and fixing R = 6. Right: performance when varying R and fixing r = 6 
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shown in Figure 4, the algorithm is also much less sensitive to the 
choice of R. If r is too small with respect to the expected soma 
radius, many FPs arise. This is because the neural network may 
hallucinate small non-soma Hght regions as soma (one example 
occurs in the leftmost region of the central substack shown in 
Fig. 3). Increasing r beyond six continues to improve precision at 
the expense of recall, but keeping the i^i -measure almost 
constant. 

3.4 Using the manifold filter 

We finally evaluated the effect of the manifold modeling tech- 
nique described in Section 2.4. We first computed the estimated 
manifold distances and filtered all predictions with (f^^>40. This 
step removed some FPs without losing any significant recall. We 
then applied reestimated manifold distances on the remaining 
predictions and computed a recall-precision curve when varying 
the manifold distance threshold. Starting from the set of cells 
detected with r = 6 and = 5.5, we obtained the curve shown 
in Figure 5. The area under this curve is 0.97. As expected, pre- 
cision decreases with the distance threshold, while recall in- 
creases. Still, it is possible to reduce significantly the number of 
FPs without sacrificing recall. Any threshold between 1 1 and 27 
voxels keeps the -measure >0.96. The sensitivity of the overall 
method with respect to r and R is further reduced after the mani- 
fold filter: any value of r and R between 5 and 7 yields an 
7^1 -measure >0.95 if using a distance threshold of 20. With the 
application of the manifold filter (with threshold 20), the algo- 
rithm detected 224222 Purkinje cells in the whole cerebellum 
image (Fig. 6). This number is consistent with previous estimates 
based on stereology (Biamonte et ai, 2009; Woodruff-Pak, 
2006). 

3.5 Discussion 

Quantitative histological measurements are typically restricted to 
small portions of tissue. In fact, on the one hand, conventional 
microscopy techniques are unable to generate large-scale volu- 
metric datasets (Osten and Margrie, 2013). On the other hand, 
currently available algorithms for cell segmentation or localiza- 
tion usually require carefully tuned parameters and therefore 
cannot cope with the image variability that may be present in 
large-scale datasets. The only well-established quantitative 
method to investigate cytoarchitecture on a brain-wide scale is 
stereology (Schmitz and Hof, 2005), which, however, provides 
only estimates of the number of cells, without a precise map of 



their spatial distribution. Furthermore, stereological estimates 
rely on a priori assumptions about the imaged tissue, which 
make the final result dependent on the starting hypothesis 
(Schmitz and Hof, 2005). 

Here, we presented an algorithm for fully automatic detection 
of neuronal soma in CLSM fluorescence images, in which human 
supervision is needed only for the initial training of a neural 
network. After training on a small sample of substacks, the 
neural network is able to generalize well on different brain re- 
gions. This suggests that the network trained on one cell type and 
one brain will be able to perform semantic deconvolution equally 
well for the same cell type of other brains within a uniform 
population of animals. The robustness of the method when 
applied to heterogeneous samples should be further investigated. 
In particular, it might be necessary to collect larger and more 
representative datasets if one wants to detect cells with different 
sizes/shapes or in comparative studies involving animals with 
anatomical variations or disease models. In our experience, the 
overall work devoted to labeling was modest compared with the 
work devoted to sample preparation and image acquisition. 

The capabilities of this algorithm have been demonstrated by 
localizing all the Purkinje neurons in a whole mouse cerebellum. 
The algorithm is robust against the contrast variability in differ- 
ent image regions. The sensitivity of performance with respect to 
the mean shift kernel radius and the manifold filter distance is 
modest (Figs 4 and 5) and the seed selection parameter r can 
be chosen according to the expected size of visible soma. 
One possible future extension to improve our quantitative results 
is to associate a confidence score or a probability to each 
detection. 

Our method obtains the best results when the manifold filter is 
used. This can be a limitation, as the cellular subset under inves- 
tigation might be scattered in all the three dimensions, without 
any apparent uniformity in the spatial distribution. Further, even 
if neurons lie on a manifold in physiological conditions, this re- 
gularity might disappear (at least partly) in presence of a path- 
ology. Thus, if one wants to compare healthy and unhealthy 
subjects, a manifold-independent localization pipeline could pro- 
vide more reliable results. Anyhow, the modeling of the manifold 
can be useful also in this case, allowing a quantitative description 
of the spatial scattering of neurons. 

The combination of the method presented here with genetically 
targeted expression of fluorescent proteins, or with whole-brain 
immunohistochemistry (Chung et al., 2013), will allow precisely 
localizing and counting selected neuronal populations throughout 
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Fig. 5. Effects of the manifold distance filter. Left: recall-precision curve. Right: performance measures as a function of the distance threshold 
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Fig. 6. The final set of predicted Purkinje cell centers as a point cloud 

the entire encephalon, eventually leading to a set of brain-wide 
cytoarchitectonic maps of the various cell types. 

4 CONCLUSION 

We presented an automated pipeline for the localization of neur- 
onal soma in large-scale images obtained with CLSM. The 
method has been validated on images of the cerebellum of an 
L7-GFP mouse. We found that semantic deconvolution signifi- 
cantly boosted performance at a modest cost in terms of hand 
labeling. We obtained an Fi value of 0.96. While some margin 
for improvement may remain, human labeling disagreement sug- 
gests that Fi values >0.98 are unlikely to be attainable. We fur- 
ther demonstrate the algorithm by producing the full map of 
Purkinje cells in the whole mouse cerebellum. 
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